An Information Theoretic Feature Selection Framework for Big Data under Apache Spark

نویسندگان

  • Sergio Ramírez-Gallego
  • Héctor Mouriño-Talín
  • David Martínez-Rego
  • Verónica Bolón-Canedo
  • José Manuel Benítez
  • Amparo Alonso-Betanzos
  • Francisco Herrera
چکیده

With the advent of extremely high dimensional datasets, dimensionality reduction techniques are becoming mandatory. Among many techniques, feature selection has been growing in interest as an important tool to identify relevant features on huge datasets –both in number of instances and features–. The purpose of this work is to demonstrate that standard feature selection methods can be parallelized in Big Data platforms like Apache Spark, boosting both performance and accuracy. We thus propose a distributed implementation of a generic feature selection framework which includes a wide group of well-known Information Theoretic methods. Experimental results on a wide set of real-world datasets show that our distributed framework is capable of dealing with ultra-high dimensional datasets as well as those with a huge number of samples in a short period of time, outperforming the sequential version in all the cases studied.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Feature Selection Method for Large-Scale Network Traffic Classification Based on Spark

Currently, with the rapid increasing of data scales in network traffic classifications, how to select traffic features efficiently is becoming a big challenge. Although a number of traditional feature selection methods using the Hadoop-MapReduce framework have been proposed, the execution time was still unsatisfactory with numeral iterative computations during the processing. To address this is...

متن کامل

A comparison on scalability for batch big data processing on Apache Spark and Apache Flink

*Correspondence: [email protected] 1Department of Computer Science and Artificial Intelligence, CITIC-UGR (Research Center on Information and Communications Technology), University of Granada, Calle Periodista Daniel Saucedo Aranda, 18071 Granada, Spain Full list of author information is available at the end of the article Abstract The large amounts of data have created a need for new fram...

متن کامل

Parallel Large-Scale Attribute Reduction on Cloud Systems

The rapid growth of emerging information technologies and application patterns in modern society, e.g., Internet, Internet of Things, Cloud Computing and Tri-network Convergence, has caused the advent of the era of big data. Big data contains huge values, however, mining knowledge from big data is a tremendously challenging task because of data uncertainty and inconsistency. Attribute reduction...

متن کامل

Evolutionary Feature Selection for Big Data Classification: A MapReduce Approach

Nowadays, many disciplines have to deal with big datasets that additionally involve a high number of features. Feature selection methods aim at eliminating noisy, redundant, or irrelevant features that may deteriorate the classification performance. However, traditionalmethods lack enough scalability to copewith datasets ofmillions of instances and extract successful results in a delimited time...

متن کامل

Ddup - towards a deduplication framework utilising apache spark

This paper is about a new framework called DeduPlication (DduP). DduP aims to solve large scale deduplication problems on arbitrary data tuples. DduP tries to bridge the gap between big data, high performance and duplicate detection. At the moment a first prototype exists but the overall project status is work in progress. DduP utilises the promising successor of Apache Hadoop MapReduce [Had14]...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • CoRR

دوره abs/1610.04154  شماره 

صفحات  -

تاریخ انتشار 2016